The format of this lab follows the same format as the previous ones. Your goal is to predict the value of the third column (which will be missing on the test set) using the techniques we have learned so far.
Read in the following libraries and to load the crimes dataset:
library(readr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(viridis)
## Loading required package: viridisLite
library(glmnet)
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-13
library(ggmap)
library(GGally)
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
library(FNN)
library(gam)
## Loading required package: splines
## Loaded gam 1.14-4
tract <- read_csv("https://statsmaths.github.io/ml_data/tract_median_income.csv")
## Parsed with column specification:
## cols(
## .default = col_double(),
## obs_id = col_character(),
## train_id = col_character(),
## state = col_character(),
## county = col_character(),
## cbsa_name = col_character(),
## cbsa_type = col_character(),
## total_population = col_integer(),
## total_households = col_integer()
## )
## See spec(...) for full column specifications.
This lab will be scored using RMSE.
To start let us do a few plots to help us better picture what we are trying to predict as well as the data subets
qmplot(lon, lat, data = tract,
color = tract$train_id, size = I(0.5))
## Using zoom = 3...
## Map from URL : http://tile.stamen.com/toner-lite/3/0/1.png
## Map from URL : http://tile.stamen.com/toner-lite/3/1/1.png
## Map from URL : http://tile.stamen.com/toner-lite/3/2/1.png
## Map from URL : http://tile.stamen.com/toner-lite/3/0/2.png
## Map from URL : http://tile.stamen.com/toner-lite/3/1/2.png
## Map from URL : http://tile.stamen.com/toner-lite/3/2/2.png
## Map from URL : http://tile.stamen.com/toner-lite/3/0/3.png
## Map from URL : http://tile.stamen.com/toner-lite/3/1/3.png
## Map from URL : http://tile.stamen.com/toner-lite/3/2/3.png
## Warning: `panel.margin` is deprecated. Please use `panel.spacing` property
## instead
The plot of income quartiles seem to be a good predictor of median incomes. Also, the quartiles are very highly correlated which may indicate we only need one or a few of them to use as a predictor, similar to the age variables (although they do not exhibit as strong correlations across each other as income quartiles)
qplot(income_q1, median_income, color = cbsa_type, data=tract)
## Warning: Removed 12799 rows containing missing values (geom_point).
income_frame <- data.frame(tract$income_q1,tract$income_q2,tract$income_q3,tract$income_q4,tract$income_p95)
ggpairs(income_frame)
qplot(age_15_18, median_income, color = cbsa_type, data=tract)
## Warning: Removed 12799 rows containing missing values (geom_point).
age_frame <- data.frame(tract$age_00_02,tract$age_03_04,tract$age_05_05,tract$age_06_08,tract$age_09_11,tract$age_12_14,tract$age_15_18)
ggpairs(age_frame)
Next, I did a few exploratory graphs of the various variables but only kept those that seemed to have some sort of effect (or potential effect) on median income. -We see that as avg. duration is farther from zero, the house values tend to fall. -Also, we see that car_alone and public_transit seem to be good predictors of median income -Healthcare, especially private healthcare also seems to be significant in predicting median income -Housing costs seem to have a direct correlation to median income as well. As does gini, the measure of inequality
qplot(avg_duration, median_income, data=tract)
## Warning: Removed 12799 rows containing missing values (geom_point).
qplot(car_alone, median_income, data=tract)
## Warning: Removed 12799 rows containing missing values (geom_point).
qplot(public_transit, median_income, data=tract)
## Warning: Removed 12799 rows containing missing values (geom_point).
qplot(healthcare_private, median_income, data=tract)
## Warning: Removed 12799 rows containing missing values (geom_point).
qplot(housing_costs, median_income, data=tract)
## Warning: Removed 12799 rows containing missing values (geom_point).
qplot(median_income, gini, data=tract)
## Warning: Removed 12799 rows containing missing values (geom_point).
Next, I decide to include many of the variables identified above in an elastic net and tune the parameter to determine what variables I may want to include in my final model. After cycling through various values of lambda up to 50, I found that only 3 variables came out with significant variables.
X1 <- as.matrix(tract[,10:70])
y1 <- tract$median_income
X_train <- X1[tract$train_id == "train",]
y_train <- y1[tract$train_id == "train"]
model <- cv.glmnet(X_train, y_train, alpha = 0.9)
model$lambda
## [1] 31804.6840 28979.2436 26404.8074 24059.0771 21921.7350 19974.2685
## [7] 18199.8095 16582.9886 15109.8016 13767.4885 12544.4228 11430.0108
## [13] 10414.6002 9489.3958 8646.3840 7878.2631 7178.3799 6540.6724
## [19] 5959.6171 5430.1812 4947.7788 4508.2318 4107.7329 3742.8132
## [25] 3410.3120 3107.3493 2831.3010 2579.7760 2350.5958 2141.7753
## [31] 1951.5059 1778.1395 1620.1745 1476.2427 1345.0973 1225.6026
## [37] 1116.7234 1017.5168 927.1234 844.7603 769.7141 701.3349
## [43] 639.0302 582.2605 530.5341 483.4029 440.4587 401.3296
## [49] 365.6766 333.1909 303.5911 276.6209 252.0467 229.6555
coef(model, s = model$lambda[50])
## 62 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 463.48519407
## total_population .
## total_households .
## income_q1 .
## income_q2 0.58688980
## income_q3 0.42313607
## income_q4 .
## income_p95 .
## median_age .
## age_00_02 .
## age_03_04 .
## age_05_05 .
## age_06_08 .
## age_09_11 .
## age_12_14 .
## age_15_18 .
## same_house .
## same_county .
## same_state .
## avg_duration .
## car_alone .
## carpool .
## public_transit .
## walked .
## transit_other .
## at_home .
## white_alone .
## black_alone .
## native_american_alone .
## asian_alone .
## pacific_alone .
## other_alone .
## multiracial .
## healthcare_private .
## healthcare_public .
## healthcare_none .
## housing_costs 0.01097553
## housing_000000_000000 .
## housing_010000_014999 .
## housing_015000_019999 .
## housing_020000_024999 .
## housing_025000_029999 .
## housing_030000_034999 .
## housing_035000_039999 .
## housing_040000_049999 .
## housing_050000_059999 .
## housing_060000_069999 .
## housing_070000_079999 .
## housing_080000_089999 .
## housing_090000_099999 .
## housing_100000_124999 .
## housing_125000_149999 .
## housing_150000_174999 .
## housing_175000_199999 .
## housing_200000_249999 .
## housing_250000_299999 .
## housing_300000_399999 .
## housing_400000_499999 .
## housing_500000_749999 .
## housing_750000_999999 .
## housing_above_1_million .
## gini .
plot(model)
model2_pred <- predict.cv.glmnet(model, newx=X1, s="lambda.1se")
sqrt(tapply((tract$median_income - model2_pred)^2,
tract$train_id, mean))
## test train valid
## NA 3154.908 3197.087
Next, let us see how well we can capture the lon. and lat. variables using a knn.reg function. We find that the model seems to do a relatively good job capturing the variation in incomes. Below is also code for a function that tunes the k value for our knn reg. It makes sense tune for low values of k since housing values can significant vary from one (lat,lon) combination to another. We find that a value of 5 leads to the lowest validation set RMSE so we use that for our model.
X <- as.matrix(select(tract, lon, lat))
y <- as.matrix(tract$median_income)
X_train <- X[tract$train_id == "train",]
y_train <- y[tract$train_id == "train"]
tuning_function <- function(){
result_final <- 30000
i_final <- 1
for(i in 1:30){
mod <- knn.reg(train = X_train, test = X, y = y_train, k = i)
mod_pred <- mod$pred
result <- sqrt(tapply((tract$median_income - mod_pred)^2, tract$train_id, mean))[3]
if(result < result_final){
result_final <- result
i_final <- i
}
}
return(i_final)
}
result <- tuning_function()
result
## [1] 5
model1 <- knn.reg(train = X_train, test = X, y = y_train, k = result)
model1_pred <- model1$pred
sqrt(tapply((tract$median_income - model1_pred)^2,
tract$train_id, mean))
## test train valid
## NA 15101.52 19458.68
Finally, let us try combining these two in a gam function to evaluate whether it improves our RMSE. I also added a few more variables that very slightly reduced the RMSE of my model but make intuitive sense to include. This will be my final model.
model3 <- gam(median_income ~ s(model1_pred) + s(model2_pred) + s(gini)+ s(housing_above_1_million) + s(healthcare_private), subset = train_id == "train", data=tract)
model3_pred <- predict(model3, newdata = tract)
sqrt(tapply((tract$median_income - model3_pred)^2,
tract$train_id, mean))
## test train valid
## NA 3089.978 3127.678
The code below assumes that you have added a prediction named median_income_pred to every row of the dataset.
tract$median_income_pred <- model3_pred
submit <- select(tract, obs_id, median_income_pred)
write_csv(submit, "class09_submit.csv")
Now, upload this file (ends with “.Rmd”), the HTML output (ends with “.nb.html” or “.html”), and the csv file to GitHub.